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ABSTRACT 


The investigation proposes the application of an ontological semantic approach to describing workflow 
control patterns, research workflow step patterns, and the meaning of the workflows in terms of domain 
knowledge. The approach can provide wide opportunities for semantic refinement, reuse, and composition 
of workflows. Automatic reasoning allows verifying those compositions and implementations and provides 
machine-actionable workflow manipulation and problem-solving using workflows. The described approach 
can take into account the implementation of workflows in different workflow management systems, the 
organization of workflows collections in data infrastructures and the search for them, the semantic approach 
to the selection of workflows and resources in the research domain, the creation of research step patterns 
and their implementation reusing fragments of existing workflows, the possibility of automation of problem- 
solving based on the reuse of workflows. The application of the approach to CWFR conceptions is proposed. 


1. INTRODUCTION 


Providing the reuse of data includes organizing the processes of their processing and analysis by both 
humans and machines. Workflows are applied to describe complex processes and control their recurring 
execution for data processing and analysis or instance in research problem-solving or experimenting. The 
same workflows can be used to solve similar problems in different situations by different researchers in 
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communities and on the same or different data. Thus, once developed, a workflow can be reused multiple 
times, thereby serving the reuse of research data and reproducing research results. 


The guiding principles of FAIR data [1] serve to provide the reuse of data and may be supported by 
accompanying the data with workflows suitable for their processing. In this case, the FAIR data principles 
should be fully applied to processed data and related workflows as a kind of data. To support the FAIR 
properties of data, it is necessary that all metadata and resources referred to the data during their lifecycle 
and applied to them comply with the principles of FAIR data. After all, the FAIR data guidelines are equally 
guiding for managing research workflows since they are a type of data, a type of metadata, and in their 
turn require being described by the metadata. This means that workflows should be findable (F), accessible 
(A), interoperable (I), and reusable (R). 


To manage the lifecycle of research problem solving and ensure the reuse and reproducibility of research 
results, itis important to provide the search for relevant workflows, their interoperability in data infrastructures, 
and correct implementations of research processes. 


The conceptions of the Canonical Workflow Framework for Research (CWEFR) [2,3] include the creation 
of canonical steps of activities for typical research approaches common to most domains. Research patterns 
consist of canonical steps. The step descriptions are available from the libraries of canonical steps. For 
different contexts, domains, and communities, there are libraries of specialized packages for certain step 
patterns. 


The interoperability of workflows is supported by using FAIR digital objects [4]. During the execution of 
each step of the workflow, a digital object is formed, labeled with a unique persistent identifier, defined by 
the type, described by attributes, and having a state contributed by the completed steps. Access to any state 
is possible via digital object identifiers. The use of canonical workflow step patterns for solving research 
problems allows to standardize the research process, ensure the necessary formal steps, and simplify the 
reuse of workflows and processed data. In addition, workflow step patterns and typing the steps in specific 
domains are a good hint to the machine on how to process data. 


The principles of FAIR data initially declare machine-actionability, this property should be applied to 
workflows as well. According to this principle, the reuse of workflows does not mean that once developed 
by a human, the workflow can be reused by a machine to solve the same problem. But this means that a 
machine that had not yet worked on the problem should be able to semantically analyze the problem, find 
relevant data, find the way to obtain the necessary result, possibly by creating a new workflow for solving 
the problem. So, it can solve it and publish both the obtained results and the created tools in such a way 
that both humans and machines can find and apply them for further reuse or to reproduce the results. 


Semantic approaches for these purposes can be based on the use of the domain and special ontologies. 
The role of ontologies in semantic approaches to describing workflows was emphasized in works related 
to the myExperiment research community [5] and later Research Objects (RO) [6]. The last one is popular 
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and useful today. The ideas used in this study are the development and translation of the proposals reflected 
in the work [7]. 


The myExperiment project was popular for some communities since there were libraries of domain- 
specific services and workflows findable by keywords. The weakness was that most of the available 
workflows shared by researchers did not implement domain methods but just access to specific data 
resources. The project was still useful but faded in popularity because other tools like Python libraries were 
more targeted on how to solve problems in various domains. Jupyter Notebook became popular as well 
because Python was popular, not since it is so good for sharing and reusing resources. Similarly, GitHub 
is not a comfortable research infrastructure, but it stores lots of parametrized and implemented methods 
findable through basic search engines. These instruments have serious limitations and manual approaches 
from the view of data interoperability and integration. Thus, semantically searchable collections of workflows 
oriented to the implementation of research methods and their quality classification in domains could be 
useful for communities and for organizing their activities and relationships. 


This concept paper related to advanced workflow technologies proposes extending investigations in 
CWFR with an ontological approach to the definition of workflow semantics from points of view of different 
ontologies and different levels of workflow definitions. Three levels of workflow semantics definition are 
used here. The ontological level describes the semantics of workflows, activities, and their elements in terms 
of domain concepts. The data model layer defines the control patterns used in different languages and 
workflow management systems (for example, BPMN, YAWL). At the workflow specification level, data 
transformation specifications and the semantics of research data processed by workflows are defined. The 
ontology-based semantic approach to workflow management provides a semantic search for workflows for 
their reuse and ensures interoperability at all levels. The following sections describe levels of workflow 
description and the way in which these descriptions can be applied for workflow reuse and interoperability. 
Then an example of workflow step patterns for problem-solving and modeling are described. 


2. THE PRINCIPLES OF SEMANTIC DESCRIPTION OF WORKFLOWS 


For semantic annotation of workflows, domain ontologies and special ontologies for different purposes 
are used simultaneously. Domain ontologies are necessary to determine the semantics of workflows, 
activities, and places (or inputs and outputs) from the point of view of the research problem being solved. 
They are also used to simplify the possibility to reuse them in the domain. Ontologies related to the research 
lifecycle define concepts of standard or preferred procedures for achieving research objectives. They can 
include such processes as the stages from problem statement to reporting on their solution, the stages of 
transformation and integration of heterogeneous data, the stages of testing hypotheses and theories following 
the scientific method, the stages of applying machine and deep learning, verification of decisions being 
made, and other special approaches to research. The data provenance [8] ontology defines authorship, 
licensing, relevance, provenance, and other non-functional information about workflows and data. Semantic 
annotation can be defined as expressions of concepts of several ontologies to define an annotated object 
from different angles of view. 
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A comprehensive formal ontological description and requirements in terms of ontologies allow finding 
relevant workflows, their fragments, individual activities, and resources for problem-solving. Means for 
collecting and classifying workflows, as well as searching for them by ontological descriptions, are of 
fundamental importance. They are necessary to offer a way for problem-solving to experts using the available 
resources and research results of the domain community, and when using verifiable formal approaches, it 
makes possible automated decisions for the reuse of workflows. 


3. DEFINITION OF WORKFLOW CONTROLLING PATTERNS 


At the level of workflow data models, it is suggested to use workflow patterns (workflowpatterns.com) 
[9]. By control patterns, metamodels are described that define constructs providing certain control rules in 
workflow languages. The model and semantics of such constructs are defined and workflow languages that 
use them are listed. The workflow metamodel ontology defines concepts for various patterns, allows 
expressing and annotating workflows in their terms, and provides finding relevant activities taking into 
account the semantics of control patterns. 


There are patterns with very different semantics. The basic control patterns are sequence, parallel split 
(And-Split), synchronization (And-Join), exclusive choice (Xor-Split), simple merge (Xor-Join). The 
synchronization control pattern (Fig. 1) specifies that several branches and joined to one when all input 
branches are enabled. In the ontology of workflow control patterns, the AndJoin concept can be defined 
as having multiple “inputBranch” relations and single “outputBranch” relation with the Activity concept. 
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Figure 1. An example of a synchronization control pattern. 


An input of the workflow activity can be annotated in terms of this ontology as an instance AndJoin with 
specified Activity concept (or subconcept) instances in addition to the definition of the research domain 
semantics of the type of this activity input and other descriptions. Specifying workflow annotations in terms 
of this ontology makes it possible to transfer workflows from one system to another or execute them in 
their systems. In any case, their use will meet the specifications. 


4. DEFINITION OF DOMAIN SEMANTICS OF WORKFLOWS 


Interoperability at the level of workflow specifications is provided using languages of workflow 
management systems directly. Annotating activities with a verbal description, simple linking with concepts, 
keywords, or terms from a domain dictionary is insufficient for expressing semantics readable for humans 
and machines. The annotation should determine not only the domain meaning of workflow elements but 
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link them with other elements in semantics descriptions for the possibility of analyzing the semantics of 
the system or its parts. The description of the semantics of data transformation and integrity requirements 
can be provided using preconditions and post-conditions of workflows as a whole and separate activities. 
Types of dataflow can be defined for inputs/outputs (places) between activities. If the expressive power of 
the workflow specification languages does not provide capabilities for that, the restrictions are defined only 
at the ontological level. 


In addition to the domain related to the research object, it is desirable to define ontologies related to the 
most general knowledge about performing research, the requirements of the scientific method and open 
science, and others. Research infrastructure collections acquire workflows that are accessible using semantic 
search. One of the types of acquired specifications is research workflow patterns that describe required or 
preferred steps of workflows providing the research lifecycle or specific types of research. Research step 
patterns should be annotated in detail in terms of ontologies of research. The workflow patterns and their 
activities are backed by collections of respective implementations of workflows, services, resources, stored 
in digital object containers. Their annotations are more specialized or equal to the specifications of workflow 
step patterns. In various domains, these implementations may use specialized resources known and available 
in research communities. 


5. SCENARIOS OF APPLYING THE SEMANTIC DESCRIPTIONS OF WORKFLOWS 


During problem-solving, researchers can reuse both workflow patterns to implement them, or existing 
implementations of methods and processes. Semantic annotations of workflow step patterns can be queries 
to find their implementations. Workflow patterns, having rich metadata in terms of ontologies, are the key 
to creating well-annotated implementations. Formal descriptions as metadata allow applying automated 
reasoning for selecting and reusing relevant workflows. 


Workflows can be developed with the analysis of requirement models of research problems. Depending 
on the problem statement type (such as data acquiring and analysis, modeling, or machine learning), 
different research step patterns are reused. Partly, the requirement models and workflows are developed 
using the found patterns. Then some fragments of the patterns can be implemented, or relevant existing 
workflow implementations from the collection can be found for them. 


Ontological reasoning allows formal automatic search for relevant workflows or their fragments, verifying, 
and controlling the compliance of substituted parts of workflows and implementations on the semantic 
level. To find and verify resources relevant to requirements, the semantic annotation of the resource must 
belong to an equivalent concept or a subconcept of the concept defined by requirement annotation. So 
relevant research patterns can be found by the problem statement knowledge as subconcepts of requirement, 
or vice versa the pattern step can be proposed if the problem statement mentions its subconcepts. 
Implementations replacing the pattern activities should follow the requirements (to be subconcepts) imposed 
by the patterns and by the problem statements. 
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Workflow step patterns may be implemented not completely but partially if there are no requirements 
for full implementation. At the same time, descriptions of the precondition and post-condition requirements 
between workflow activities after removing a step should be reconciled using ontological reasoning and 
making simple transformations for correct reconnection of the fragments of patterns and their implementations. 


Workflow pattern fragments can be implemented by relevant fragments of existing workflows found in 
registries based on logical inference. Not only the relevance of activities but also the relevance of inputs 
and outputs is evaluated. For the implementation of the activity, its semantics and the types of its inputs 
and outputs defined in semantic annotations are analyzed taking into account all the ontologies used 
including the data model compatibility. The search for activities with relevant types of input and activities 
with relevant types of output is performed in combination with the reachability problem solution between 
them [10]. The compliance of the semantics of the implementation with the requirements of the pattern 
can also be checked taking into account the role chains. An example of required substitution is given below 
(Fig. 2). Let the upper workflow be the canonical steps for modeling the research object. It needs substitution 
by some implemented workflow fragments. The lower one is a workflow used for substitution. Activity 
“Getting Observational Data” (let’s sign it G by the first letter) can be implemented by a workflow fragment 
consisting of two activities “Extract Real-World Object Data” (E) and “Loading and transforming Data” (L) 
if the input type of activity G is a subtype of the input type of activity E (weakened precondition), the output 
type of G is a supertype of the output type of L (strengthened post-condition), and E is achievable from L. 
Semantic correspondence of activities themselves can be a non-trivial problem since the activities can use 
very different granularity of similar processes. Different methods can be applied for provable refining G by 
the fragment E-L or just evaluation if as a similarity between them. 


Getting 
Observational 
Data 


Knowledge Result 
Collection Comparison 
Hypothesis Model 
Generation Generation 
Extract Loading and 
Real-World 
7 Object Data Evaluation 
noman of Model 


Modeling 


Inverse Quali 
Problem 


Solving 


Modeling 
Physical 
Parameters 


Figure 2. A workflow substitution for canonical steps of modeling. 


The described approach can take into account the implementation of workflows in different workflow 
management systems, the organization of workflows collections in data infrastructures, and the search for 
them. It can be applied for the semantic approach to the selection of workflows and resources in the 
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research domain, the implementation research step patterns by reusing fragments of existing workflows, 
and automation of problem-solving based on the reuse of workflows. 


6. A CANONICAL RESEARCH WORKFLOW EXAMPLE 


In the frame of the investigations in research infrastructures, a lifecycle for problem-solving was 
developed [11]. It includes setting a problem as requirement specifications, describing requirements in 
terms of domain knowledge, searching for relevant data sources, integrating them, operationalizing 
requirements, searching or implementing methods and workflows, and experimenting. In addition, at 
different stages of the lifecycle, formal verification of the performed stages can be provided and the results 
of finished stages can be published in the domain community. The integration stage can be specified as a 
sequence of three data management tasks: data model (data definition and manipulation language) 
reconciliation, then schema-matching, and finally entity resolution in data from different sources (Fig. 3). 


Requirement Know edge iM Data Source Data : ae Method : : 
x X ; ; Operationalization : Experimenting 
Modeling Collection Selection Integration Implementation 
Data Model Schema Entity 
Unification Mapping Resolution 


Figure 3. The data-driven problem-solving lifecycle canonical steps. 


Another approach to experimenting especially in the case of investigating unobservable physical 
parameters of the research object is modeling it (see the canonical steps in Fig. 2). Hypotheses are generated 
using domain knowledge. A model is generated according to that knowledge and the hypotheses. Then 
distributions of modeled parameters are compared to observed ones. These recurring approaches to research 
on data are represented as steps of canonical workflows. 


The shown examples of canonical research patterns are domain-independent. A lot of domains use 
similar approaches to problem-solving lifecycle beginning with gathering data from multiple sources then 
integrating them into the information system, identifying the same entities in their data, and applying 
methods to the consolidated data to solve the problem and publish the result. As well, modeling real-world 
objects and comparing them to observed data from them is known as a way to check research hypotheses. 


In [12], the solution to astronomical problems was presented, which can be an example of the problem- 
solving lifecycle application with data reuse at different stages. The first described problem was finding 
hierarchical multiple stellar systems among data on binary stars. The requirement model of this problem 
was created as a decomposition tree. The domain knowledge of binary and multiple stars was accumulated 
in the domain ontologies and conceptual schemas for domain data representation. The ontologies and 
schemas were published for further reuse. Evaluating ontological relevance, heterogeneous catalogs of 
binary stars were integrated into the conceptual scheme, their structural heterogeneity was resolved. The 
results of integration were published to be used in the research community. Then, the algorithm of binary 
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and multiple stellar system cross-matching was performed, in which observed parameters of stars were used 
to identify the same stellar systems in data from different catalogs. The list of identifications was published 
as a new Catalog [13]. Finally, the identified systems were analyzed for compliance with the hypothesis of 
hierarchical stellar systems (see Fig. 4). All steps of the problem solving can be implemented as 
implementations of the research lifecycle workflow step patterns. The ontological descriptions of these 
implementations comply with the descriptions of the step patterns and refine them. So the implementations 
can be registered using those formal descriptions and further can be found using requirements of the steps 
patterns in combination with some problem requirements. 


Hierarchical Binary Star Binary Star Catalog Binary Star Multiple Finding 


Stellar System Knowledge Catalog Schema Cross- System Hierarchical 
Problem Collection Selection Matching Matching Identification Systems 


Figure 4. The steps of solving the problem of finding hierarchical stellar systems. 


After the problem of hierarchical stellar systems had been solved, another problem of creating a Galaxy 
model of binary stars began to be solved. The published results of solving the previous problem in the same 
domain were partially reused. Almost all stages used those results including the ontologies, the schemas, 
the integrated catalogs of binary stars, and the list of cross-matched binary systems. At the same time, 
hypotheses of the distributions of the binary stars in the Galaxy were formed using domain knowledge and 
relevant publications. Following the hypotheses, the Galaxy models of binary stars were generated. Then, 
to select the best hypotheses, the distribution of binary star parameters according to the data from catalogs 
were compared to the distributions of stars in the visible part of the generated Galaxy models. For more 
details on data reuse, see [12]. 


The solution to these research problems can be presented as refinements of the steps of the problem- 
solving canonical workflow in the domain of binary stars combined with the canonical steps of modeling. 
And the published results of solving the first problem could be found and reused for solving the second 
problem (Figure 5). 


Binary Galaxy Binary Star Binary Star Catalog Binary Star Distribution 
Modeling Knowledge Catalog Schema Cross- Evaluation in 
Problem Definition Selection Matching Matching Observed Data 


Binary Star 
Hypotheses 
Generation 


Galaxy Distribution Galaxy 
Model Evaluation in Model 
Generation Modeled Data Selection 


Figure 5. The steps of solving the problem of binary star Galaxy creation. 


e The step of creating a problem requirement model can lead to the detection of common sub-steps of 
the problems. 
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e At the knowledge specification step, the requirement model can refer to the domain ontologies and 
published data schemas. Some new concepts should be defined and can be published to enhance 
the domain knowledge if the community confirms their commitment. 

e The ontologies can be used to search for relevant registered catalogs. 

e The step of data model integration can be skipped since the catalogs use the same principles of record 
representations. It can be verified that the empty activity can just transfer the schemas to the next 
activity. 

e At the schema matching step, the results of integrating the binary star catalogs can be retrieved and 
reused. 

e At the entity resolution step, the algorithm for cross-matching binary stars can be partially reused for 
two-component stars only. 

e At the operationalization step, methods for compiling data on star systems, hypotheses and methods 
for generating models, and methods for evaluation of stellar parameter distributions in the galaxy 
should be developed since there were no relevant existing methods found. The developed specifications 
can be published. 

e Atthe experimenting step, experiments should be performed to generate different models and compare 
stellar parameter distributions to observed ones. The results of the research should be published. 


Such refinements of the canonical workflow become possible as a result of referring the steps to concepts 
of domain ontologies and searching for relevant implementations for them in the library of contextual steps 
and the published results of previous research. 


The considered examples solve specific problems in astronomy, however, most research domains have 
the necessary resources for proposed approaches. Similar principles of distinguishing types of research 
objects and observable and computable parameters can be found in almost any area, including astronomy 
and astrophysics, materials science, biomedicine, earth sciences, social science, and others. This is reflected 
in existing domain models, in which the same basic principles of research are significantly duplicated. The 
most general knowledge in these areas can be reduced to common ontologies, data and metadata schemas, 
standard methods, and processes. 


7. ISSUES OF IMPLEMENTATION WITHIN THE CANONICAL WORKFLOW FRAMEWORK 


First of all, the proposed ontological approach can be implemented if the chosen implementations of 
CWEFR support semantic annotations in terms of ontologies. For example, elements of workflows can be 
supported with expressions of concepts so that individuals of ontological concepts can store their identifiers 
to refer to relevant activities, places (inputs and outputs), and types. 


The following references can be used: 


e Canonical steps refer to concepts of the ontology determining the meaning of a step in the research 
lifecycle. 
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For example, the entity resolution step of data resource integration can refer to the concept “entity 
resolution” in an ontology of data management and data-driven research. 


e Contextual packages of steps also refer to the ontology of the context domain. 


For example, the package implementation of the entity resolution step in astronomy can refer to the 
concept of “cross-matching”, which is restricted to be applied only to data on astronomical objects and 
can be a subconcept of “entity resolution”. 


e The implementations of certain activities specific to the research domain refer to ontological 
expressions that reflect their precise semantics in the domain. 


For example, an implementation of the activity of binary star cross-matching can be a very complex 
problem requiring research and making a separate sub-workflow. Both the activity and the sub-workflow 
can be semantically defined by the expression describing the cross-matching of binary stars only. Such an 
expression defines a concept that does not belong directly to the ontology but precisely defines the semantics 
of the described activity. 


e The places of workflow modeling patterns correspond to the inputs and outputs of the canonical 
steps. They should be described by concepts of the workflow modeling pattern ontology. 

e The places of contextual package steps and workflow implementations should additionally be 
described in terms of domain ontologies to constrain data types. They also define the types of input 
attributes of activities and the types of digital objects created as a result of the activities. 

e The library of contextual packages can contain general steps related to domain types. Besides this, 
the activities of all published workflow implementations for solving specific research problems should 
be classified in terms of corresponding domain ontologies. 


The search for relevant reusable workflow implementations can be organized based on queries to the 
ontology. The requirements of the query actually should be defined as a concept by which referred workflows 
and their fragments are classified. The query can simultaneously use constraints of several ontologies: for 
example requirements from the point of view of the domain, defined workflow element kinds, the provenance 
of the workflow, and others. Thus, the IDs of all relevant fragments are retrieved by the query for reuse. 


During the reuse of existing implemented activities, steps, or workflows, the digital objects are 
supplemented with metadata in terms of the provenance ontology to track what data were processed, where 
they came from, which workflows processed the data, which agents executed the workflows and processed 
the data, when the data was processed, how long it is relevant, and so on. 


As for the selection of possible workflow management systems for CWFR implementation, different 
workflow environments may include semantic descriptions of both the resources used in processes and the 
processes themselves. RO-Crate supports semantic metadata based on types schema.org and similar 
dictionaries that can be linked to Research Object containers and resources in them [14]. UIMA defines 
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types and features identified during data analysis that can be associated with RDF resources. It allows 
relating metadata with resources and workflows through resource specifiers for organizing registries of 
reusable resources [15]. There are investigations related to the ontological description of the OPC UA [16] 
specifications, which abstractly describes nodes and services for organizing data exchange between them. 
Jupyter Notebook workflow extensions can be used if any definition of a workflow can be identified to link 
metadata. These and other workflow management tools can be supplemented with instruments for describing 
resource semantics and semantic search. What an object must have to be described with semantic metadata 
is an identifier. Ontology-supporting instruments can be independent of the workflow management tools 
being used. They store both domain definitions and metadata that describe resources and refer to the 
resources through identifiers. If possible, workflow specifications should keep references to the metadata 
as well. By semantic metadata, relevant resources can be found and reused. 


The proposed prototype architecture implementing the approach as a complementing part of possible 
architectural decisions in CWFR is shown (Figure 6). A service for resource annotation allows describing 
resources such as source data, programs, workflows, and resulting data are annotated in terms of domain 
models and some special models. It is necessary to link annotations to corresponding resource identifiers 
and keep them in the triple store as RDF/OWL specifications. They are considered as a part of specified 
digital objects, so the links to the annotations as metadata are stored in the digital objects too. Annotations 
of workflows needed to be implemented can be considered as simple queries for relevant resources by 
their similar annotations. A query building service is necessary for covering the requirements to necessary 
resource compositions like workflow fragment structures. The SPARQL endpoint is used to search for 
relevant resource IDs. The retrieved resources can be considered as candidates to be reused for consistent 
implementation of workflows. 


Queries 
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Annotating 


SPARQL 
Endpoint 
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Resource 
IDs 


Annotations 


Relevant 
IDs 


Annotations 
& IDs 


= 
Triple Store 

(Domain Models, 

Resource Metadata) 
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Workflows, Results) 


Links 
to metadata 


Figure 6. The prototype architecture. 


8. FOLLOWING THE PRINCIPLES OF FAIR DATA 


The ontological approach to providing the reuse of canonical workflow steps and implemented workflows 
was motivated by the principles of FAIR data. 


To provide the findability of workflows and processed data, the workflows and all their components are 
widely, comprehensively, and formally described with semantic annotation metadata in terms of domain 
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ontologies and some special ontologies. Workflows are classified using ontological reasoning [17] over 
those metadata. The publishing workflows consists in defining semantic annotations and registering them 
by classification. Query expressions in terms of ontological concepts are used to search for relevant 
workflows and their fragments. 


Data accessibility is provided by referring ontological annotations to identifiers of workflows and their 
components and returning them by queries. The SPARQL [18] access point interface and web protocols 
could be used for it. 


Interoperability of workflows and data in the proposed approach is provided by the use of a common 
formal knowledge representation model [19] for all specifications including domain models, workflow 
language models, research step models, provenance models. Automatic logical reasoning for ontologies on 
semantic annotations allows searching and controlling relevance and meaning of workflows from the point 
of view of ontologies. Reasoning provides the ability to interpret the meaning of data and resources when 
solving problems by both a human and a machine. Formal ontologies are a kind of dictionary fulfilling the 
FAIR data principles the best. 


The principle of reusability is provided by supporting domain ontologies that may describe any other 
domain-specific standards. On the other hand, the use of special ontologies allows defining non-functional 
requirements for data and workflows such as data provenance, data quality, and other aspects. 


9. DISCUSSION 


The proposed approach is natural to be applied in a cross-disciplinary space. It is not intended for a 
specific domain and would be not so effective in a closed environment. 


On the one hand, domain communities, which are smaller the more specific the area, work to determine 
the knowledge of different domains. Such communities use the knowledge of more generalized communities 
and domains. Most likely, the domains already have a groundwork defining domain concepts, common 
conceptual schemes, methods, and tools used by the community, as well as known typical processes for 
solving various tasks. Today, there is a tendency to intensively develop and discuss domain models for the 
semantic approaches at specialized conferences and workshops devoted to the problems of certain research 
domains. 


On the other hand, in the cross-disciplinary space, there are standards followed by almost any domain 
community and the most common domains used by everyone in the discipline. All this knowledge can and 
should be reused in communities. 


For example, there are no commonly applied formal ontologies in astronomy, but the Unified Content 
Descriptors (UCD) [20] standard is a widely used instrument for linking resources with astronomical domain 
concepts. Common domains of knowledge that are used in all subdomains and in solving almost any 
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problems are astrometry with the celestial coordinate system, multicolored photometry, classifications of 
astronomical objects and stars. Almost any task in this domain includes a subtask of cross-matching data 
about astronomical objects in different catalogs. There are standards of the International Virtual Observatory 
that are widely used. 


Finally, the most general knowledge is interdisciplinary and is shared by everyone involved in research. 
For example, the ontology and metadata of the data provenance, measurements, and their accuracy, 
experiments at the physical and the informational levels. Communities of any domain can have common 
or own standards for those issues. Most research areas distinguish research objects of various types, sets of 
observable parameters that can be measured by observation instruments, and sets of parameters that cannot 
be observed directly but can be estimated based on the observed parameters. The type of the research 
object may depend on the composition of parameter values, or, on the contrary, the type of object limits 
the possible parameter values. All this knowledge can have common specifications across domains. 


Research skills of any domain researcher allow relating resources to the domain concepts, or even 
defining constraints combining several concepts to express the semantics of resources. If there are some 
arrangements for semantic modeling in the domain, they can be naturally used to describe resources with 
them. Domain researchers should be documenting any resources applied by them including source data, 
schemas, workflows, activities in them, data at the inputs and outputs of activities, result datasets in terms 
of their native domain concepts. This is often not done because there are no requirements to provide such 
descriptions. However, data management plans could include such requirements. Having detailed 
annotations of resources with domain knowledge, a machine can offer the most relevant resources to 
substitute and implement workflows and to solve domain problems. 


Thus, the proposed approach is not so domain-driven but community-driven. To be alive, a community 
should define its domain, useful commonalities, shared resources in them, and reuse resources available 
in its domain and a wider context. To implement an interdisciplinary environment, it is necessary to support 
domain communities and include them in more general communities with access to common knowledge. 
Each community organizes its part of the resource registry: domain ontologies, schemas, programs and 
libraries, processes. The semantics of resources should be described in terms of domain concepts to classify 
them in the domain. To reuse resources, it is necessary to provide a search in the registry by domain 
concepts, access by retrieved identifiers, support integration, and interoperability of found resources. 


10. CONCLUSION 


A semantic approach to workflow search, implementation, and composition in the context of investigations 
of canonical research workflows has been represented. Specifying workflow model semantics has been 
proposed for interoperability of possible different workflow systems and execution of workflow fragments 
in different management systems. Specifying domain semantics of workflows using ontologies can be 
applied for organizing workflow classification and relevant workflow search, correct pattern implementation, 
and workflow fragment integration and reuse. Examples of patterns applicable to any research domain and 
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examples of their implementation for solving specific problems in astronomy were described. The principles 
of application of the presented approach to CWFR concepts and a prototype architecture implementing 
this approach as a complementing part of possible architectural decisions in CWFR are proposed. 
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